
Success in any financial market requires one to identify solid investments. When a stock or derivative is undervalued, it makes sense to buy. If it's overvalued, perhaps it's time to sell. While these finance decisions were historically made manually by professionals, technology has ushered in new opportunities for retail investors. Data scientists, specifically, may be interested to explore quantitative trading, where decisions are executed programmatically based on predictions from trained models. In this competition, financial data for the Japanese market will be provided, allowing retail investors to analyze the market to the fullest extent.
Japan Exchange Group, Inc. (JPX) is a holding company operating one of the largest stock exchanges in the world, Tokyo Stock Exchange (TSE), and derivatives exchanges Osaka Exchange (OSE) and Tokyo Commodity Exchange (TOCOM). JPX is hosting this competition and is supported by AI technology company AlpacaJapan Co., Ltd.
This competition will involve building portfolios from the stocks eligible for predictions. Specifically, each participant ranks the stocks from highest to lowest expected returns and is evaluated on the difference in returns between the top and bottom 50 stocks. You'll have access to financial data from the Japanese market, such as stock information and historical stock prices to train and test your model. After the training phase is complete, the competition will compare your models against real future returns.
The training data begins in January 2017 with about 1,860 stocks with new stocks added through December 2020 for a total of 2,000 stocks. Below is a brief overview and descriptive statistics of the variables in the training data.
1)conda install -c https://conda.anaconda.org/plotly plotly
2)conda install -c conda-forge lightgbm
#upload kaggle json file via file upload
!pip install google.colab
from google.colab import files
files.upload()
#install kaggle api
!pip install -q kaggle
#Next we will install kaggle Api using pip installation.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
#this permissions change avoids a warning on kaggle tool startup :
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list -s jpx-tokyo-stock-exchange-prediction
#Download Data
! kaggle competitions download -c jpx-tokyo-stock-exchange-prediction
#create new directory
!mkdir jtmp1
#unzip jpx tokyo exchange file inside jtmp1 folder
!unzip jpx-tokyo-stock-exchange-prediction.zip -d jtmp1
#Remove the Zip files
!rm /content/jpx-tokyo-stock-exchange-prediction.zip
import warnings, gc
import numpy as np
import pandas as pd
import matplotlib.colors
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
from datetime import datetime, timedelta
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error,mean_absolute_error
from lightgbm import LGBMRegressor
from decimal import ROUND_HALF_UP, Decimal
warnings.filterwarnings("ignore")
import plotly.figure_factory as ff
init_notebook_mode(connected=True)
temp = dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=12), width=800))
colors=px.colors.qualitative.Plotly
train=pd.read_csv("F:\\cli\\Data_science_portfolios\\jpx_capstone_projects\\IIM Nagpur Final Project-20220730T163838Z-001\\IIM Nagpur Final Project\\data\\stock_prices.csv",
parse_dates=['Date'])
stock_list=pd.read_csv("F:\\cli\\Data_science_portfolios\\jpx_capstone_projects\\IIM Nagpur Final Project-20220730T163838Z-001\\IIM Nagpur Final Project\\data\\stock_list.csv")
print("The training data begins on {} and ends on {}.\n".format(train.Date.min(),train.Date.max()))
display(train.describe().style.format('{:,.2f}'))
The training data begins on 2017-01-04 00:00:00 and ends on 2021-12-03 00:00:00.
| SecuritiesCode | Open | High | Low | Close | Volume | AdjustmentFactor | ExpectedDividend | Target | |
|---|---|---|---|---|---|---|---|---|---|
| count | 2,332,531.00 | 2,324,923.00 | 2,324,923.00 | 2,324,923.00 | 2,324,923.00 | 2,332,531.00 | 2,332,531.00 | 18,865.00 | 2,332,293.00 |
| mean | 5,894.84 | 2,594.51 | 2,626.54 | 2,561.23 | 2,594.02 | 691,936.56 | 1.00 | 22.02 | 0.00 |
| std | 2,404.16 | 3,577.19 | 3,619.36 | 3,533.49 | 3,576.54 | 3,911,255.94 | 0.07 | 29.88 | 0.02 |
| min | 1,301.00 | 14.00 | 15.00 | 13.00 | 14.00 | 0.00 | 0.10 | 0.00 | -0.58 |
| 25% | 3,891.00 | 1,022.00 | 1,035.00 | 1,009.00 | 1,022.00 | 30,300.00 | 1.00 | 5.00 | -0.01 |
| 50% | 6,238.00 | 1,812.00 | 1,834.00 | 1,790.00 | 1,811.00 | 107,100.00 | 1.00 | 15.00 | 0.00 |
| 75% | 7,965.00 | 3,030.00 | 3,070.00 | 2,995.00 | 3,030.00 | 402,100.00 | 1.00 | 30.00 | 0.01 |
| max | 9,997.00 | 109,950.00 | 110,500.00 | 107,200.00 | 109,550.00 | 643,654,000.00 | 20.00 | 1,070.00 | 1.12 |
train_date=train.Date.unique()
returns=train.groupby('Date')['Target'].mean().mul(100).rename('Average Return')
close_avg=train.groupby('Date')['Close'].mean().rename('Closing Price')
vol_avg=train.groupby('Date')['Volume'].mean().rename('Volume')
fig = make_subplots(rows=3, cols=1,
shared_xaxes=True)
for i, j in enumerate([returns, close_avg, vol_avg]):
fig.add_trace(go.Scatter(x=train_date, y=j, mode='lines',
name=j.name, marker_color=colors[i]), row=i+1, col=1)
fig.update_xaxes(rangeslider_visible=False,
rangeselector=dict(
buttons=list([
dict(count=6, label="6m", step="month", stepmode="backward"),
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(count=2, label="2y", step="year", stepmode="backward"),
dict(step="all")])),
row=1,col=1)
fig.update_layout(template=temp,title='JPX Market Average Stock Return, Closing Price, and Shares Traded',
hovermode='x unified', height=700,
yaxis1=dict(title='JPX Index Return', ticksuffix='%'),
yaxis2_title='Closing Price', yaxis3_title='Shares Traded',
showlegend=False)
fig.show()
The graphs above show the market's average stock return, closing price, and shares traded since January 2017. While there has been much fluctuation over the past four years, the number of shares traded has slightly decreased from the volume in early 2017.
stock_list['SectorName']=[i.rstrip().lower().capitalize() for i in stock_list['17SectorName']]
stock_list['Name']=[i.rstrip().lower().capitalize() for i in stock_list['Name']]
train_df = train.merge(stock_list[['SecuritiesCode','Name','SectorName']], on='SecuritiesCode', how='left')
train_df['Year'] = train_df['Date'].dt.year
years = {year: pd.DataFrame() for year in train_df.Year.unique()[::-1]}
for key in years.keys():
df=train_df[train_df.Year == key]
years[key] = df.groupby('SectorName')['Target'].mean().mul(100).rename("Avg_return_{}".format(key))
df=pd.concat((years[i].to_frame() for i in years.keys()), axis=1)
df=df.sort_values(by="Avg_return_2021")
fig = make_subplots(rows=1, cols=5, shared_yaxes=True)
for i, col in enumerate(df.columns):
x = df[col]
mask = x<=0
fig.add_trace(go.Bar(x=x[mask], y=df.index[mask],orientation='h',
text=x[mask], texttemplate='%{text:.2f}%',textposition='auto',
hovertemplate='Average Return in %{y} Stocks = %{x:.4f}%',
marker=dict(color='red', opacity=0.7),name=col[-4:]),
row=1, col=i+1)
fig.add_trace(go.Bar(x=x[~mask], y=df.index[~mask],orientation='h',
text=x[~mask], texttemplate='%{text:.2f}%', textposition='auto',
hovertemplate='Average Return in %{y} Stocks = %{x:.4f}%',
marker=dict(color='green', opacity=0.7),name=col[-4:]),
row=1, col=i+1)
fig.update_xaxes(range=(x.min()-.15,x.max()+.15), title='{} Returns'.format(col[-4:]),
showticklabels=False, row=1, col=i+1)
fig.update_layout(template=temp,title='Yearly Average Stock Returns by Sector',
hovermode='closest',margin=dict(l=250,r=50),
height=600, width=1000, showlegend=False)
fig.show()
In 2021, nearly all industries saw a positive return on average, with the highest in Energy Resources at about 0.13% overall, while in 2018, all sectors saw a negative return except for Electric Power & Gas.
Since some of the stocks were added in December 2020, We will use the data filtered after this date so that the data will consist of 231 days of stock prices for all 2,000 stocks.
train_df=train_df[train_df.Date>'2020-12-23']
print("New Train Shape {}.\nMissing values in Target = {}".format(train_df.shape,train_df['Target'].isna().sum()))
New Train Shape (462000, 15). Missing values in Target = 0
fig = go.Figure()
x_hist=train_df['Target']
fig.add_trace(go.Histogram(x=x_hist*100,
marker=dict(color=colors[0], opacity=0.7,
line=dict(width=1, color=colors[0])),
xbins=dict(start=-40,end=40,size=1)))
fig.update_layout(template=temp,title='Target Distribution',
xaxis=dict(title='Stock Return',ticksuffix='%'), height=450)
fig.show()
pal = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(0, 360, 18)]
fig = go.Figure()
for i, sector in enumerate(df.index[::-1]):
y_data=train_df[train_df['SectorName']==sector]['Target']
fig.add_trace(go.Box(y=y_data*100, name=sector,
marker_color=pal[i], showlegend=False))
fig.update_layout(template=temp, title='Target Distribution by Sector',
yaxis=dict(title='Target Range',ticksuffix='%'),
margin=dict(b=150), height=750, width=900)
fig.show()
While most sectors have returns between 10% and -10%, there are quite a few outliers across all industries, with some returns as high as 62% in Commercial & Wholesale Trade and others as low as -31% in IT & Services sector. The graph below shows the stock price movements within each sector.
train_date=train_df.Date.unique()
sectors=train_df.SectorName.unique().tolist()
sectors.insert(0, 'All')
open_avg=train_df.groupby('Date')['Open'].mean()
high_avg=train_df.groupby('Date')['High'].mean()
low_avg=train_df.groupby('Date')['Low'].mean()
close_avg=train_df.groupby('Date')['Close'].mean()
buttons=[]
fig = go.Figure()
for i in range(18):
if i != 0:
open_avg=train_df[train_df.SectorName==sectors[i]].groupby('Date')['Open'].mean()
high_avg=train_df[train_df.SectorName==sectors[i]].groupby('Date')['High'].mean()
low_avg=train_df[train_df.SectorName==sectors[i]].groupby('Date')['Low'].mean()
close_avg=train_df[train_df.SectorName==sectors[i]].groupby('Date')['Close'].mean()
fig.add_trace(go.Candlestick(x=train_date, open=open_avg, high=high_avg,
low=low_avg, close=close_avg, name=sectors[i],
visible=(True if i==0 else False)))
visibility=[False]*len(sectors)
visibility[i]=True
button = dict(label = sectors[i],
method = "update",
args=[{"visible": visibility}])
buttons.append(button)
fig.update_xaxes(rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=3, label="3m", step="month", stepmode="backward"),
dict(count=6, label="6m", step="month", stepmode="backward"),
dict(step="all")]), xanchor='left',yanchor='bottom', y=1.16, x=.01))
fig.update_layout(template=temp,title='Stock Price Movements by Sector',
hovermode='x unified', showlegend=False, width=1000,
updatemenus=[dict(active=0, type="dropdown",
buttons=buttons, xanchor='left',
yanchor='bottom', y=1.01, x=.01)],
yaxis=dict(title='Stock Price'))
fig.show()
In the candlestick charts above, the boxes represent the daily spread between the open and close prices and the lines represent the spread between the low and high prices. The color of the boxes indicates whether the close price was greater or lower than the open price, with green indicating a higher closing price on that day and red indicating a lower closing price. In late August, the market saw a consecutive 14-day period where the close price was greater than the open price.
stock=train_df.groupby('Name')['Target'].mean().mul(100)
stock_low=stock.nsmallest(7)[::-1].rename("Return")
stock_high=stock.nlargest(7).rename("Return")
stock=pd.concat([stock_high, stock_low], axis=0).reset_index()
stock['Sector']='All'
for i in train_df.SectorName.unique():
sector=train_df[train_df.SectorName==i].groupby('Name')['Target'].mean().mul(100)
stock_low=sector.nsmallest(7)[::-1].rename("Return")
stock_high=sector.nlargest(7).rename("Return")
sector_stock=pd.concat([stock_high, stock_low], axis=0).reset_index()
sector_stock['Sector']=i
stock=stock.append(sector_stock,ignore_index=True)
fig=go.Figure()
buttons = []
for i, sector in enumerate(stock.Sector.unique()):
x=stock[stock.Sector==sector]['Name']
y=stock[stock.Sector==sector]['Return']
mask=y>0
fig.add_trace(go.Bar(x=x[mask], y=y[mask], text=y[mask],
texttemplate='%{text:.2f}%',
textposition='auto',
name=sector, visible=(False if i != 0 else True),
hovertemplate='%{x} average return: %{y:.3f}%',
marker=dict(color='green', opacity=0.7)))
fig.add_trace(go.Bar(x=x[~mask], y=y[~mask], text=y[~mask],
texttemplate='%{text:.2f}%',
textposition='auto',
name=sector, visible=(False if i != 0 else True),
hovertemplate='%{x} average return: %{y:.3f}%',
marker=dict(color='red', opacity=0.7)))
visibility=[False]*2*len(stock.Sector.unique())
visibility[i*2],visibility[i*2+1]=True,True
button = dict(label = sector,
method = "update",
args=[{"visible": visibility}])
buttons.append(button)
fig.update_layout(title='Stocks with Highest and Lowest Returns by Sector',
template=temp, yaxis=dict(title='Average Return', ticksuffix='%'),
updatemenus=[dict(active=0, type="dropdown",
buttons=buttons, xanchor='left',
yanchor='bottom', y=1.01, x=.01)],
margin=dict(b=150),showlegend=False,height=700, width=900)
fig.show()
Among stocks with the highest return on average since December 2020, 6 of the 7 were in the IT & Services sector and one was in the Pharmaceutical sector. The IT & Services and Pharmaceutical sectors also make up 6 of the stocks with the lowest returns on average. The graph below shows the relationships between the top 5 stocks with the highest average returns, Enechange Ltd., For Startups, Inc., Symbio Pharmaceuticals, Fronteo, Inc., and Emnet Japan Co. Ltd.
stocks=train_df[train_df.SecuritiesCode.isin([4169,7089,4582,2158,7036])]
df_pivot=stocks.pivot_table(index='Date', columns='Name', values='Close').reset_index()
pal=['rgb'+str(i) for i in sns.color_palette("coolwarm", len(df_pivot))]
fig = ff.create_scatterplotmatrix(df_pivot.iloc[:,1:], diag='histogram', name='')
fig.update_traces(marker=dict(color=pal, opacity=0.9, line_color='white', line_width=.5))
fig.update_layout(template=temp, title='Scatterplots of Highest Performing Stocks',
height=1000, width=1000, showlegend=False)
fig.show()
corr=train_df.groupby('SecuritiesCode')[['Target','Close']].corr().unstack().iloc[:,1]
stocks=corr.nlargest(10).rename("Return").reset_index()
stocks=stocks.merge(train_df[['Name','SecuritiesCode']], on='SecuritiesCode').drop_duplicates()
pal=sns.color_palette("magma_r", 14).as_hex()
rgb=['rgba'+str(matplotlib.colors.to_rgba(i,0.7)) for i in pal]
fig = go.Figure()
fig.add_trace(go.Bar(x=stocks.Name, y=stocks.Return, text=stocks.Return,
texttemplate='%{text:.2f}', name='', width=0.8,
textposition='outside',marker=dict(color=rgb, line=dict(color=pal,width=1)),
hovertemplate='Correlation of %{x} with target = %{y:.3f}'))
fig.update_layout(template=temp, title='Most Correlated Stocks with Target Variable',
yaxis=dict(title='Correlation',showticklabels=False),
xaxis=dict(title='Stock',tickangle=45), margin=dict(b=100),
width=800,height=500)
fig.show()
df_pivot=train_df.pivot_table(index='Date', columns='SectorName', values='Close').reset_index()
corr=df_pivot.corr().round(2)
mask=np.triu(np.ones_like(corr, dtype=bool))
c_mask = np.where(~mask, corr, 100)
c=[]
for i in c_mask.tolist()[1:]:
c.append([x for x in i if x != 100])
cor=c[::-1]
x=corr.index.tolist()[:-1]
y=corr.columns.tolist()[1:][::-1]
fig=ff.create_annotated_heatmap(z=cor, x=x, y=y,
hovertemplate='Correlation between %{x} and %{y} stocks = %{z}',
colorscale='viridis', name='')
fig.update_layout(template=temp, title='Stock Correlation between Sectors',
margin=dict(l=250,t=270),height=800,width=900,
yaxis=dict(showgrid=False, autorange='reversed'),
xaxis=dict(showgrid=False))
fig.show()
Before creating additional features, I will first generate the adjusted close price using the function provided by the competition hosts in the Train Demo Notebook, which will adjust the close price to account for any stock splits or reverse splits. Using the stock's adjusted close price, I will create a set of new features, including the stock price moving average, exponential moving average, return, and volatility, each over a period of 5, 10, 20, 30, and 50 days. These features are shown in the graphs below across each sector.
def adjust_price(price):
"""
Args:
price (pd.DataFrame) : pd.DataFrame include stock_price
Returns:
price DataFrame (pd.DataFrame): stock_price with generated AdjustedClose
"""
# transform Date column into datetime
price.loc[: ,"Date"] = pd.to_datetime(price.loc[: ,"Date"], format="%Y-%m-%d")
def generate_adjusted_close(df):
"""
Args:
df (pd.DataFrame) : stock_price for a single SecuritiesCode
Returns:
df (pd.DataFrame): stock_price with AdjustedClose for a single SecuritiesCode
"""
# sort data to generate CumulativeAdjustmentFactor
df = df.sort_values("Date", ascending=False)
# generate CumulativeAdjustmentFactor
df.loc[:, "CumulativeAdjustmentFactor"] = df["AdjustmentFactor"].cumprod()
# generate AdjustedClose
df.loc[:, "AdjustedClose"] = (
df["CumulativeAdjustmentFactor"] * df["Close"]
).map(lambda x: float(
Decimal(str(x)).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP)
))
# reverse order
df = df.sort_values("Date")
# to fill AdjustedClose, replace 0 into np.nan
df.loc[df["AdjustedClose"] == 0, "AdjustedClose"] = np.nan
# forward fill AdjustedClose
df.loc[:, "AdjustedClose"] = df.loc[:, "AdjustedClose"].ffill()
return df
# generate AdjustedClose
price = price.sort_values(["SecuritiesCode", "Date"])
price = price.groupby("SecuritiesCode").apply(generate_adjusted_close).reset_index(drop=True)
return price
train=train.drop('ExpectedDividend',axis=1).fillna(0)
prices=adjust_price(train)
def create_features(df):
df=df.copy()
col='AdjustedClose'
periods=[5,10,20,30,50]
for period in periods:
df.loc[:,"Return_{}Day".format(period)] = df.groupby("SecuritiesCode")[col].pct_change(period)
df.loc[:,"MovingAvg_{}Day".format(period)] = df.groupby("SecuritiesCode")[col].rolling(window=period).mean().values
df.loc[:,"ExpMovingAvg_{}Day".format(period)] = df.groupby("SecuritiesCode")[col].ewm(span=period,adjust=False).mean().values
df.loc[:,"Volatility_{}Day".format(period)] = np.log(df[col]).groupby(df["SecuritiesCode"]).diff().rolling(period).std()
return df
price_features=create_features(df=prices)
price_features.drop(['RowId','SupervisionFlag','AdjustmentFactor','CumulativeAdjustmentFactor','Close'],axis=1,inplace=True)
price_names=price_features.merge(stock_list[['SecuritiesCode','Name','SectorName']], on='SecuritiesCode').set_index('Date')
price_names=price_names[price_names.index>='2020-12-29']
price_names.fillna(0, inplace=True)
features=['MovingAvg','ExpMovingAvg','Return', 'Volatility']
names=['Average', 'Exp. Moving Average', 'Period', 'Volatility']
buttons=[]
fig = make_subplots(rows=2, cols=2,
shared_xaxes=True,
vertical_spacing=0.1,
subplot_titles=('Adjusted Close Moving Average',
'Exponential Moving Average',
'Stock Return', 'Stock Volatility'))
for i, sector in enumerate(price_names.SectorName.unique()):
sector_df=price_names[price_names.SectorName==sector]
periods=[0,10,30,50]
colors=px.colors.qualitative.Vivid
dash=['solid','dash', 'longdash', 'dashdot', 'longdashdot']
row,col=1,1
for j, (feature, name) in enumerate(zip(features, names)):
if j>=2:
row,periods=2,[10,30,50]
colors=px.colors.qualitative.Bold[1:]
if j%2==0:
col=1
else:
col=2
for k, period in enumerate(periods):
if (k==0)&(j<2):
plot_data=sector_df.groupby(sector_df.index)['AdjustedClose'].mean().rename('Adjusted Close')
elif j>=2:
plot_data=sector_df.groupby(sector_df.index)['{}_{}Day'.format(feature,period)].mean().mul(100).rename('{}-day {}'.format(period,name))
else:
plot_data=sector_df.groupby(sector_df.index)['{}_{}Day'.format(feature,period)].mean().rename('{}-day {}'.format(period,name))
fig.add_trace(go.Scatter(x=plot_data.index, y=plot_data, mode='lines',
name=plot_data.name, marker_color=colors[k+1],
line=dict(width=2,dash=(dash[k] if j<2 else 'solid')),
showlegend=(True if (j==0) or (j==2) else False), legendgroup=row,
visible=(False if i != 0 else True)), row=row, col=col)
visibility=[False]*14*len(price_names.SectorName.unique())
for l in range(i*14, i*14+14):
visibility[l]=True
button = dict(label = sector,
method = "update",
args=[{"visible": visibility}])
buttons.append(button)
fig.update_layout(title='Stock Price Moving Average, Return,<br>and Volatility by Sector',
template=temp, yaxis3_ticksuffix='%', yaxis4_ticksuffix='%',
legend_title_text='Period', legend_tracegroupgap=250,
updatemenus=[dict(active=0, type="dropdown",
buttons=buttons, xanchor='left',
yanchor='bottom', y=1.105, x=.01)],
hovermode='x unified', height=800,width=1200, margin=dict(t=150))
fig.show()
In the graphs of the stock price Moving Averages (MA) and Exponential Moving Averages (EMA), when the shorter period, the 10-Day average, crosses above the longer period, the 50-day average, the closing price tends to decrease, which is typically indicative of a sell signal. Conversely, when the 10-Day MA/EMA crosses the 50-Day MA/EMA from below, the closing price increases, which typically indicates a buy signal. In addition, the Exponential Moving Averages tend to respond to stock price changes more quickly as it puts greater weight on more recent observations. In the graphs of the Stock Return, there is greater fluctuation between longer periods, while Stock Volatility tends to be more stable over time, with more fluctuation in shorter periods.
Based on the Sharpe Ratio of the daily spread of returns. For each forecast day, all active stocks will be ranked in order of their predicted return. The returns for a single day treat the 50 highest (e.g. 0 to 49) ranked stocks as purchased and the lowest (e.g. 1850 to 1999) ranked 50 stocks as shorted. The stocks are then weighted based on their ranks and the total returns for the portfolio are calculated assuming the stocks were purchased the next day and sold the day after that.
Since risk control is also an important element of investment, the competing score is the mean/standard deviation of the time series of daily spread returns, rather than the simple mean or sum of daily spread returns. This makes it necessary to build a model that can respond to changes in the distribution of data and produce stable and high performance, rather than a model that only wins big on certain days.
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size: int = 50, toprank_weight_ratio: float = 2) -> float:
"""
Args:
df (pd.DataFrame): predicted results
portfolio_size (int): # of equities to buy/sell
toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
Returns:
(float): sharpe ratio
"""
def _calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
"""
Args:
df (pd.DataFrame): predicted results
portfolio_size (int): # of equities to buy/sell
toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
Returns:
(float): spread return
"""
assert df['Rank'].min() == 0
assert df['Rank'].max() == len(df['Rank']) - 1
weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
return purchase - short
buf = df.groupby('Date').apply(_calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
sharpe_ratio = buf.mean() / buf.std()
return sharpe_ratio
ts_fold = TimeSeriesSplit(n_splits=20, gap=20000)
prices=price_features.dropna().sort_values(['Date','SecuritiesCode',])
y=prices['Target'].to_numpy()
X=prices.drop(['Target'],axis=1)
hello_test=[]
feat_importance=pd.DataFrame()
sharpe_ratio=[]
for fold, (train_idx, val_idx) in enumerate(ts_fold.split(X, y)):
print("\n========================== Fold {} ==========================".format(fold+1))
X_train, y_train = X.iloc[train_idx,:], y[train_idx]
X_valid, y_val = X.iloc[val_idx,:], y[val_idx]
print("Train Date range: {} to {}".format(X_train.Date.min(),X_train.Date.max()))
print("Valid Date range: {} to {}".format(X_valid.Date.min(),X_valid.Date.max()))
# Split into Train and Validation
#Train Part
X_train.drop(['Date','SecuritiesCode'], axis=1, inplace=True)
#Validation part
X_val=X_valid[X_valid.columns[~X_valid.columns.isin(['Date','SecuritiesCode'])]]
val_dates=X_valid.Date.unique()[1:-1]
# Check Shape of Train and Validation
print("\nTrain Shape: {} {}, Valid Shape: {} {}".format(X_train.shape,y_train.shape,
X_val.shape, y_val.shape))
params = {'n_estimators': 500,
'num_leaves' : 100,
'learning_rate': 0.1,
'colsample_bytree': 0.9,
'subsample': 0.8,
'reg_alpha': 0.4,
'metric': 'mae',
'random_state': 21}
#Select Gradient Boosting Method Algorithm
gbm = LGBMRegressor(**params).fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
verbose=300,
eval_metric=['mae','mse'])
# Make Prediction using Gradient Boosting Method Machine Learning Model
y_pred = gbm.predict(X_val)
# Check the Error Measurement Mean Square Error and Root Mean Square Error
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
mae = mean_absolute_error(y_val, y_pred)
# Select Most Important Features
feat_importance["Importance_Fold"+str(fold)]=gbm.feature_importances_
feat_importance.set_index(X_train.columns, inplace=True)
#Give A Rank to Stock Script
rank=[]
X_val_df=X_valid[X_valid.Date.isin(val_dates)]
for i in X_val_df.Date.unique():
temp_df = X_val_df[X_val_df.Date == i].drop(['Date','SecuritiesCode'],axis=1)
temp_df["pred"] = gbm.predict(temp_df)
temp_df["Rank"] = (temp_df["pred"].rank(method="first", ascending=False)-1).astype(int)
rank.append(temp_df["Rank"].values)
stock_rank=pd.Series([x for y in rank for x in y], name="Rank")
df=pd.concat([X_val_df.reset_index(drop=True),stock_rank,
prices[prices.Date.isin(val_dates)]['Target'].reset_index(drop=True)], axis=1)
sharpe=calc_spread_return_sharpe(df)
sharpe_ratio.append(sharpe)
print("Valid Sharpe: {}, RMSE: {}, MAE: {}".format(sharpe,rmse,mae))
del X_train, y_train, X_val, y_val
gc.collect()
hello_test.append(stock_rank)
print("\nAverage cross-validation Sharpe Ratio: {:.4f}, standard deviation = {:.2f}.".format(
np.mean(sharpe_ratio),np.std(sharpe_ratio)))
print('Accuracy: %.3f (%.3f)' % (np.mean(sharpe_ratio), np.std(sharpe_ratio)))
========================== Fold 1 ========================== Train Date range: 2017-03-16 00:00:00 to 2017-05-25 00:00:00 Valid Date range: 2017-06-08 00:00:00 to 2017-08-30 00:00:00 Train Shape: (86307, 25) (86307,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000262773 training's l1: 0.0111506 valid_1's l2: 0.000351769 valid_1's l1: 0.0118444 Valid Sharpe: 0.10088315566046077, RMSE: 0.01889541195790939, MAE: 0.011989282886293917 ========================== Fold 2 ========================== Train Date range: 2017-03-16 00:00:00 to 2017-08-16 00:00:00 Valid Date range: 2017-08-30 00:00:00 to 2017-11-22 00:00:00 Train Shape: (192605, 25) (192605,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000283663 training's l1: 0.0112468 valid_1's l2: 0.000407456 valid_1's l1: 0.0128091 Valid Sharpe: 0.3303869365808252, RMSE: 0.0202692374860529, MAE: 0.012882969147672285 ========================== Fold 3 ========================== Train Date range: 2017-03-16 00:00:00 to 2017-11-07 00:00:00 Valid Date range: 2017-11-22 00:00:00 to 2018-02-16 00:00:00 Train Shape: (298903, 25) (298903,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.00029751 training's l1: 0.011379 valid_1's l2: 0.00054228 valid_1's l1: 0.0149695 Valid Sharpe: 0.05000360868201432, RMSE: 0.0233452319187533, MAE: 0.01502154210166262 ========================== Fold 4 ========================== Train Date range: 2017-03-16 00:00:00 to 2018-02-01 00:00:00 Valid Date range: 2018-02-16 00:00:00 to 2018-05-11 00:00:00 Train Shape: (405201, 25) (405201,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000319805 training's l1: 0.0117803 valid_1's l2: 0.000475657 valid_1's l1: 0.0145612 Valid Sharpe: 0.2920842182567579, RMSE: 0.021878269750662774, MAE: 0.014619318288243194 ========================== Fold 5 ========================== Train Date range: 2017-03-16 00:00:00 to 2018-04-24 00:00:00 Valid Date range: 2018-05-11 00:00:00 to 2018-07-31 00:00:00 Train Shape: (511499, 25) (511499,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000363302 training's l1: 0.0125704 valid_1's l2: 0.000430002 valid_1's l1: 0.0140234 Valid Sharpe: -0.002887074767818999, RMSE: 0.020791666690164943, MAE: 0.014069241138672877 ========================== Fold 6 ========================== Train Date range: 2017-03-16 00:00:00 to 2018-07-17 00:00:00 Valid Date range: 2018-07-31 00:00:00 to 2018-10-22 00:00:00 Train Shape: (617797, 25) (617797,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000378084 training's l1: 0.0128204 valid_1's l2: 0.000523569 valid_1's l1: 0.0156025 Valid Sharpe: -0.03257031362261517, RMSE: 0.022933815218631744, MAE: 0.01564586040450012 ========================== Fold 7 ========================== Train Date range: 2017-03-16 00:00:00 to 2018-10-05 00:00:00 Valid Date range: 2018-10-22 00:00:00 to 2019-01-17 00:00:00 Train Shape: (724095, 25) (724095,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000390637 training's l1: 0.0130694 valid_1's l2: 0.000874663 valid_1's l1: 0.0206058 Valid Sharpe: -0.01428019934622072, RMSE: 0.029641719979305824, MAE: 0.02066086852798318 ========================== Fold 8 ========================== Train Date range: 2017-03-16 00:00:00 to 2018-12-26 00:00:00 Valid Date range: 2019-01-17 00:00:00 to 2019-04-08 00:00:00 Train Shape: (830393, 25) (830393,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.00044006 training's l1: 0.0139115 valid_1's l2: 0.000523856 valid_1's l1: 0.0153986 Valid Sharpe: 0.2819300433243035, RMSE: 0.022949366746093755, MAE: 0.015448036971628202 ========================== Fold 9 ========================== Train Date range: 2017-03-16 00:00:00 to 2019-03-22 00:00:00 Valid Date range: 2019-04-08 00:00:00 to 2019-07-02 00:00:00 Train Shape: (936691, 25) (936691,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.00045135 training's l1: 0.0141172 valid_1's l2: 0.000500819 valid_1's l1: 0.01508 Valid Sharpe: 0.2584660169119221, RMSE: 0.022414983798893375, MAE: 0.015109650784015709 ========================== Fold 10 ========================== Train Date range: 2017-03-16 00:00:00 to 2019-06-17 00:00:00 Valid Date range: 2019-07-02 00:00:00 to 2019-09-20 00:00:00 Train Shape: (1042989, 25) (1042989,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000455483 training's l1: 0.0142104 valid_1's l2: 0.00042636 valid_1's l1: 0.014146 Valid Sharpe: 0.12554714534355496, RMSE: 0.020679243912942853, MAE: 0.014173319526775413 ========================== Fold 11 ========================== Train Date range: 2017-03-16 00:00:00 to 2019-09-04 00:00:00 Valid Date range: 2019-09-20 00:00:00 to 2019-12-11 00:00:00 Train Shape: (1149287, 25) (1149287,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000452721 training's l1: 0.014202 valid_1's l2: 0.000365989 valid_1's l1: 0.0126607 Valid Sharpe: 0.24988656509904678, RMSE: 0.01915390719847919, MAE: 0.012677184593878786 ========================== Fold 12 ========================== Train Date range: 2017-03-16 00:00:00 to 2019-11-27 00:00:00 Valid Date range: 2019-12-11 00:00:00 to 2020-03-06 00:00:00 Train Shape: (1255585, 25) (1255585,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000446769 training's l1: 0.0141117 valid_1's l2: 0.000680411 valid_1's l1: 0.0173495 Valid Sharpe: 0.08020471461769652, RMSE: 0.02615058484423361, MAE: 0.01738568700891204 ========================== Fold 13 ========================== Train Date range: 2017-03-16 00:00:00 to 2020-02-19 00:00:00 Valid Date range: 2020-03-06 00:00:00 to 2020-05-28 00:00:00 Train Shape: (1361883, 25) (1361883,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000441984 training's l1: 0.0140388 valid_1's l2: 0.00149864 valid_1's l1: 0.027884 Valid Sharpe: 0.06491395161079352, RMSE: 0.03881130227372686, MAE: 0.027957573476867018 ========================== Fold 14 ========================== Train Date range: 2017-03-16 00:00:00 to 2020-05-13 00:00:00 Valid Date range: 2020-05-28 00:00:00 to 2020-08-14 00:00:00 Train Shape: (1468181, 25) (1468181,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000507971 training's l1: 0.0150178 valid_1's l2: 0.000734122 valid_1's l1: 0.0188629 Valid Sharpe: -0.16533953463715167, RMSE: 0.027134297399036795, MAE: 0.01888813930221426 ========================== Fold 15 ========================== Train Date range: 2017-03-16 00:00:00 to 2020-07-30 00:00:00 Valid Date range: 2020-08-14 00:00:00 to 2020-11-02 00:00:00 Train Shape: (1574479, 25) (1574479,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000521078 training's l1: 0.0152651 valid_1's l2: 0.000531563 valid_1's l1: 0.015763 Valid Sharpe: 0.09106315570167536, RMSE: 0.02309394886278079, MAE: 0.015791621881491552 ========================== Fold 16 ========================== Train Date range: 2017-03-16 00:00:00 to 2020-10-19 00:00:00 Valid Date range: 2020-11-02 00:00:00 to 2021-01-22 00:00:00 Train Shape: (1680777, 25) (1680777,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.00052325 training's l1: 0.0153082 valid_1's l2: 0.000555406 valid_1's l1: 0.0159876 Valid Sharpe: -0.08661644616409514, RMSE: 0.023598486995041412, MAE: 0.01600803123037315 ========================== Fold 17 ========================== Train Date range: 2017-03-16 00:00:00 to 2021-01-07 00:00:00 Valid Date range: 2021-01-22 00:00:00 to 2021-04-09 00:00:00 Train Shape: (1787075, 25) (1787075,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000526724 training's l1: 0.0153786 valid_1's l2: 0.000524175 valid_1's l1: 0.0160875 Valid Sharpe: 0.10403263249961091, RMSE: 0.022920234097057315, MAE: 0.016103050412914748 ========================== Fold 18 ========================== Train Date range: 2017-03-16 00:00:00 to 2021-03-26 00:00:00 Valid Date range: 2021-04-09 00:00:00 to 2021-06-29 00:00:00 Train Shape: (1893373, 25) (1893373,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.00052657 training's l1: 0.0154139 valid_1's l2: 0.000441299 valid_1's l1: 0.0144037 Valid Sharpe: 0.19991240791993248, RMSE: 0.021017510904412586, MAE: 0.01441398026525823 ========================== Fold 19 ========================== Train Date range: 2017-03-16 00:00:00 to 2021-06-15 00:00:00 Valid Date range: 2021-06-29 00:00:00 to 2021-09-15 00:00:00 Train Shape: (1999671, 25) (1999671,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000522147 training's l1: 0.0153653 valid_1's l2: 0.00043877 valid_1's l1: 0.0140934 Valid Sharpe: -0.05692870807337832, RMSE: 0.020966938882459908, MAE: 0.014106814647192495 ========================== Fold 20 ========================== Train Date range: 2017-03-16 00:00:00 to 2021-09-01 00:00:00 Valid Date range: 2021-09-15 00:00:00 to 2021-12-03 00:00:00 Train Shape: (2105969, 25) (2105969,), Valid Shape: (106298, 25) (106298,) [300] training's l2: 0.000517602 training's l1: 0.0153033 valid_1's l2: 0.000531966 valid_1's l1: 0.015928 Valid Sharpe: 0.02968966733948498, RMSE: 0.02308160326997997, MAE: 0.0159384868937877 Average cross-validation Sharpe Ratio: 0.0950, standard deviation = 0.13. Accuracy: 0.095 (0.134)
gc.collect()
0
feat_importance['avg'] = feat_importance.mean(axis=1)
feat_importance = feat_importance.sort_values(by='avg',ascending=True)
pal=sns.color_palette("plasma_r", 29).as_hex()[2:]
fig=go.Figure()
for i in range(len(feat_importance.index)):
fig.add_shape(dict(type="line", y0=i, y1=i, x0=0, x1=feat_importance['avg'][i],
line_color=pal[::-1][i],opacity=0.7,line_width=4))
fig.add_trace(go.Scatter(x=feat_importance['avg'], y=feat_importance.index, mode='markers',
marker_color=pal[::-1], marker_size=8,
hovertemplate='%{y} Importance = %{x:.0f}<extra></extra>'))
fig.update_layout(template=temp,title='Overall Feature Importance',
xaxis=dict(title='Average Importance',zeroline=False),
yaxis_showgrid=False, margin=dict(l=120,t=80),
height=700, width=800)
fig.show()
import joblib
export_path = "F:\\cli\\Data_science_portfolios\\jpx_capstone_projects\\IIM Nagpur Final Project-20220730T163838Z-001\\IIM Nagpur Final Project\\model1\\"
filename="regressor_model_new.sav"
joblib.dump(gbm, export_path+ filename)
['F:\\cli\\Data_science_portfolios\\jpx_capstone_projects\\IIM Nagpur Final Project-20220730T163838Z-001\\IIM Nagpur Final Project\\model1\\regressor_model_new.sav']
import pandas as pd
op= pd.read_csv("F:\\cli\\Data_science_portfolios\\jpx_capstone_projects\\IIM Nagpur Final Project-20220730T163838Z-001\\IIM Nagpur Final Project\\submission\\Final_submission.csv")
op
| Date | SecuritiesCode | Rank | |
|---|---|---|---|
| 0 | 2021-12-06 | 1301 | 0 |
| 1 | 2021-12-06 | 1332 | 1 |
| 2 | 2021-12-06 | 1333 | 2 |
| 3 | 2021-12-06 | 1375 | 3 |
| 4 | 2021-12-06 | 1376 | 4 |
| ... | ... | ... | ... |
| 111995 | 2022-02-28 | 9990 | 1995 |
| 111996 | 2022-02-28 | 9991 | 1996 |
| 111997 | 2022-02-28 | 9993 | 1997 |
| 111998 | 2022-02-28 | 9994 | 1998 |
| 111999 | 2022-02-28 | 9997 | 1999 |
112000 rows × 3 columns
sl=pd.read_csv("C:\\Users\\PC\\stock_list.csv")
sl
| SecuritiesCode | EffectiveDate | Name | Section/Products | NewMarketSegment | 33SectorCode | 33SectorName | 17SectorCode | 17SectorName | NewIndexSeriesSizeCode | NewIndexSeriesSize | TradeDate | Close | IssuedShares | MarketCapitalization | Universe0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1301 | 20211230 | KYOKUYO CO.,LTD. | First Section (Domestic) | Prime Market | 50 | Fishery, Agriculture and Forestry | 1 | FOODS | 7 | TOPIX Small 2 | 20211230.0 | 3080.0 | 1.092828e+07 | 3.365911e+10 | True |
| 1 | 1305 | 20211230 | Daiwa ETF-TOPIX | ETFs/ ETNs | NaN | - | - | - | - | - | - | 20211230.0 | 2097.0 | 3.634636e+09 | 7.621831e+12 | False |
| 2 | 1306 | 20211230 | NEXT FUNDS TOPIX Exchange Traded Fund | ETFs/ ETNs | NaN | - | - | - | - | - | - | 20211230.0 | 2073.5 | 7.917718e+09 | 1.641739e+13 | False |
| 3 | 1308 | 20211230 | Nikko Exchange Traded Index Fund TOPIX | ETFs/ ETNs | NaN | - | - | - | - | - | - | 20211230.0 | 2053.0 | 3.736943e+09 | 7.671945e+12 | False |
| 4 | 1309 | 20211230 | NEXT FUNDS ChinaAMC SSE50 Index Exchange Trade... | ETFs/ ETNs | NaN | - | - | - | - | - | - | 20211230.0 | 44280.0 | 7.263200e+04 | 3.216145e+09 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4412 | 9994 | 20211230 | YAMAYA CORPORATION | First Section (Domestic) | Standard Market | 6100 | Retail Trade | 14 | RETAIL TRADE | 7 | TOPIX Small 2 | 20211230.0 | 2447.0 | 1.084787e+07 | 2.654474e+10 | True |
| 4413 | 9995 | 20211230 | GLOSEL Co.,Ltd. | First Section (Domestic) | Prime Market | 6050 | Wholesale Trade | 13 | COMMERCIAL & WHOLESALE TRADE | 7 | TOPIX Small 2 | 20211230.0 | 410.0 | 2.642680e+07 | 1.083499e+10 | False |
| 4414 | 9996 | 20211230 | Satoh&Co.,Ltd. | JASDAQ(Standard / Domestic) | Standard Market | 6050 | Wholesale Trade | 13 | COMMERCIAL & WHOLESALE TRADE | - | - | 20211230.0 | 1488.0 | 9.152640e+06 | 1.361913e+10 | False |
| 4415 | 9997 | 20211230 | BELLUNA CO.,LTD. | First Section (Domestic) | Prime Market | 6100 | Retail Trade | 14 | RETAIL TRADE | 6 | TOPIX Small 1 | 20211230.0 | 709.0 | 9.724447e+07 | 6.894633e+10 | True |
| 4416 | 25935 | 20211230 | ITO EN,LTD.(shares of preferred stock) | First Section (Domestic) | NaN | 3050 | Foods | 1 | FOODS | - | - | 20211230.0 | 1934.0 | 3.424696e+07 | 6.623362e+10 | False |
4417 rows × 16 columns
merged_inner = pd.merge(op, sl,on=['SecuritiesCode'], how="left")
merged_inner=merged_inner[["Date","SecuritiesCode","Rank","Name","17SectorName"]]
merged_inner
| Date | SecuritiesCode | Rank | Name | 17SectorName | |
|---|---|---|---|---|---|
| 0 | 2021-12-06 | 1301 | 0 | KYOKUYO CO.,LTD. | FOODS |
| 1 | 2021-12-06 | 1332 | 1 | Nippon Suisan Kaisha,Ltd. | FOODS |
| 2 | 2021-12-06 | 1333 | 2 | Maruha Nichiro Corporation | FOODS |
| 3 | 2021-12-06 | 1375 | 3 | YUKIGUNI MAITAKE CO.,LTD. | FOODS |
| 4 | 2021-12-06 | 1376 | 4 | KANEKO SEEDS CO.,LTD. | FOODS |
| ... | ... | ... | ... | ... | ... |
| 111995 | 2022-02-28 | 9990 | 1995 | SAC'S BAR HOLDINGS INC. | RETAIL TRADE |
| 111996 | 2022-02-28 | 9991 | 1996 | GECOSS CORPORATION | COMMERCIAL & WHOLESALE TRADE |
| 111997 | 2022-02-28 | 9993 | 1997 | YAMAZAWA CO.,LTD. | RETAIL TRADE |
| 111998 | 2022-02-28 | 9994 | 1998 | YAMAYA CORPORATION | RETAIL TRADE |
| 111999 | 2022-02-28 | 9997 | 1999 | BELLUNA CO.,LTD. | RETAIL TRADE |
112000 rows × 5 columns
merged_inner.to_csv("Final_submission.csv")
%pwd
'F:\\cli\\Data_science_portfolios\\jpx_capstone_projects\\IIM Nagpur Final Project-20220730T163838Z-001\\IIM Nagpur Final Project\\code'
Refererence